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Abstract 

Cluster- Weighted Modeling (CWM) is a flexible mixture approach for modeling the 
joint probability of data coming from a heterogeneous population as a weighted sum of 
the products of marginal distributions and conditional distributions. In this paper, we 
introduce a wide family of Cluster Weighted models in which the conditional distribu- 
tions are assumed to belong to the exponential family with canonical links which will 
be referred to as Generalized Linear Gaussian Cluster Weighted Models. Moreover, we 
show that, in a suitable sense, mixtures of generalized linear models can be considered 
as nested in Generalized Linear Gaussian Cluster Weighted Models. The proposal is 
illustrated through many numerical studies based on both simulated and real data sets. 
Keywords: Cluster- Weighted Modeling, Mixture Models, Model-Based Clustering, 
Generalized Linear models. 

1. Introduction 

Let (X',Y) be a random vector, defined on ft, composed by a d-dimensional 
explanatory variable X and a unidimensional response variable Y. Let p(x,y) be 
the joint distribution of (X' , Y) . Suppose that Q can be partitioned into G groups, 
say fii, . . . , rtQ. Cluster-Weighted Modeling (CWM) represents a convenient mixture 
approach for modeling data of this type. Indeed, in this approach, the joint probability 
of (X 1 , y)' can be written as 

G G 

p(x,y;0) = ^2p(y\x,n g )p(x\n g )p(n g ) = ^2p(y\x;€ g )p(x;il) g )Tr g , (1) 

3=1 ff=l 
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where p(y\x, fl g ) = p(y\x; £ ) is the conditional distribution of Y given x in fl g (de- 
pending on some parameters p(x|f2 g ) = p(a;; t/j ) is the probability distribution 
of x in fig (depending on some parameters tf> g ), ir g = p(O s ) is the mixing weight of 
f2 9 (7Tg > and ^2 g TT g = 1), and = {£ g ,tp g , n g ; g = 1, . . . , G} denotes the set of 
all model parameters. 

In the original formulation of Cluster Wei ghted ModelingjCW M), the random vec 
tor (X ' , Y)' is assumed to be real-valued, see 



(1999), 



Schonerl (120001) for details). Quite recently, 



Gershenfeld ( 1997) and also 



Ingrassia et al. 



Gershenfeld 



d2012al) reformu- 



lated CWM in a statistical setting showing that it is a quite general and flexible family 
of mixture models. Also, in the majority of the existence literature about CWM, a 
linear relationship of Y on x is considered for each fl g , that is 



Y = E(Y\x, n g ) + e g = fi(x; P ) + e g = /3 g + X X 



(2) 



where f3 g — (Pog, ft'i g )' ■> with j3 g £ R d+1 , and e g is assumed to have zero mean and 
a finite variance. If in each f2 ff , a Gaussian distribution is assumed for both Y\x and 
-X", say Y\x ~ N(0, a g ) and X ~ Nd(u „, S g ), then the origi nal linear Gaussian CW 



model is obtained in 



Ingrassia et al. 



(1201 2al) 



i iGershenfeldl (1 1 9971) . For robustness sake 
also introduce linear t CWM, which is based on the assumption of a (multivariate) t 
dist ribution for both Y \ x a nd X in each mixture component. Starting from this re- 
sult, |lngrassia_efa/J ((2012bJ) define a family of twelve linear t CWMs for model-based 



cluster i ng and classification. Finally, by generalizing © via a polynomial relationship, 
Punzol d2012l) defines the polynomial Gaussian CWM. 

However, in many cases we have to face with modeling categorical variables de- 
pending on numerical covariates based on data coming from a heterogeneous popula- 
tion. For example, in effectiveness studies based on administrative data, one may be 
interested to estimate many different effectiveness models for certain groups of users, 
taking into account the characteristics of each group; in the healthcare context, one typ- 
ical outcomes are mortality rate or length of stay, while in the education framework an 
outcome may be the success in an exam. In statistical literature, such problems are usu- 
ally ap proached considering finite mixtures of g e nerali z ed linear models (FMGLM ), 



see e.g. 



McLachlan (1997) 



McLachlan and Peell (I2000h 



Wedel and De Sarbo ( 1995). 



This paper generalizes the linear Gaussian CWM by considering a generalized lin- 
ear model (with canonical link) for the relationship of Y on x in each Q, g , g = 1, . . . , G. 
In particular, we introduce a wide family of Cluster Weighted (CW) models in which 
the conditional distributions are assumed to belong to the exponential family with 
canonical links which will be referred to as Generalized Linear Gaussian Cluster 
Weighted Models (GLGCWM). We remark that while FMGLM model conditional dis- 
tributions, Cluster Weighted approaches model the joint distribution as the product of 
marginal and conditional distributions. In this paper, we show also that, in some sense, 
FMGLM can be considered as nested models of GLGCWM. 

Generalized linear models (with canonical link) are regression models where Y is 
specified to be distributed according to one of the members of the exponential fam- 
ily. Accordingly, those models deal with dependent variables Y that can be either 
continuous with, for example, Gaussian, gamma or inverse Gaussian distributions, or 
discrete with, for example, binomial or Poisson distributions. The exponential family is 
a very useful class of distributions and the common properties of the distributions in the 
class enable them to be studied simultaneously rather than as a collection of unrelated 
cases. GLGCWM thus affords a general framework that, in addition to encompass 
the linear Gaussian CWM as a special case, allow us to use the CW principle also for 
discrete dependent variables, which are very common in real applications (see, e.g.. 



Wedel and De S arbc 



199 



* 



2002, 



1993 



Yang and Lai 



We remark that also 



Wedel et al. 



2005 



1993 



Xian g et al. 



Wang and Puterman 



1998 



Wang et al. 



2005) 



Gershenfeldl (1 1 99% coped with the problem of discrete set of 



values such as events, patterns or conditions, but they did not really model the joint 
probability of the dependent variable and the covariates; what they introduced in the 
CWM is the histogram of the probabilities of each state in the clusters (without explicit 
dependence on the covariates). 

The remainder of the paper is organized as follows. In Section [2j we introduce 
the Generalized Linear Gaussian CWM (GLGCWM); in Section [3] we show some 
results about the relationships of the GLGCWM with mixture of generalized linear 
models (with and without concomitants); in Section [4] we present parameter estima- 
tion according to likelihood approach; in Section [5] we describe the EM algorithm 



3 



for parameter estimation in GLCWM; in Section [6] we introduce some measures for 
performance evaluation; in Section [7] we illustrate numerical studies based on both 
simulated and real data. Finally, conclusions and perspectives for future research are 
presented in Section[8] 



2. The model 



We consider a broad family of models in which, for each Sl g , the conditional dis- 
tributions are assumed to belong to the exponential family with canonical links, that 
is 



P(y\x; L) = q(y\x; /3 Xg) = cxp 



yr]{x;f3 g ) - b [r/{x; (3 g )] 
a(X g ) 



+ c(y,X g )}, (3) 



where a(-), &(•), and c(-) are specified functions, A g is the dispersion parameter, with 
a-(Xg) > 0, and r)(x;/3 g ) = rj a = /?p + &\„x is th e canonical function (see, e.g., 



Wedel and De Sarbo 



1995 



and 



McCullagh and Neldei 



2000|). In particular, &(•) is the 



cumulant function, while a(-) and &(•) satisfy 

l i g = E(Y\x;[3 g ,X g ) = b'(r 1g ) and a 2 g =Y(Y\x;f3 g ,X g ) = a(X g )b"( Vg ), 

where b'(r) g ) and b"(rj g ) are the first and second derivatives of &(%) with respect to r\ g , 
respectively. Moreover, r\ g is related to the expected value fi(x; (3 g ) = p g , through a 
link function h(-), in the following way 

Vg = h(jJ, g ). 

For sake of simplicity, we assume Gaussian distributions for marginals, i.e. X|f2 g ~ 
Nd(fi , E fl ). Thus, equation ([TJ becomes 

G 

P(x, y;6)=Y^ l(y\ x ' Pgi x g)<t>d{x; n g , ^g)^g- (4) 

3=1 

Model will be referred to as generalized linear Gaussian CWM (GLGCWM) here- 
after. Canonical links are the identity, log, logit, inverse and squared inverse func- 
tions for the normal, Poisson, binomial, gamma and inverse Gaussian distributions (see 



McCullagh and Neldenl2000i Table 2.1). Thus, all these distributions can be taken into 
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account for modeling the Y variable in each fl g . In particular, the Poisson and the Bi- 
nomial distributions are quite useful because they allow the CW principle to be applied 
also for discrete response variables Y; this is the reason why, in the following, we shall 
focus mainly on such distributions. 

It is interesting to note that, from a clustering/classification point of view, the pos- 
terior probabilities of group membership 

p (n g \x,y;0) = »— — -2 , g = l,...,G, (5) 

p(x,y;0) 

depend on both the marginal and conditional component distributions, differently from 
the standard mixture models. 

2.1. The Binomial Gaussian CWM 

Assume that Y takes values in {0, 1, ... , M}, for some M £ N. Moreover, as- 
sume that the probability mass function of Y\x in il g is binomial with parameters 
(M, r ) g {x)), that is Y\x, il g ~ Bin(M, r y g (x)). In this case 



q(y\x;M,p 0g ,(3 g ) = {^j [ lg {x)]y [1 - lg (x)] M - y , 
where the probability "f g (x) is postulated to depend on x through the function 



(6) 



exp(/3 0g + /3 x) ( 7 ( a \ 

Jg(x) = — — y —, or, equivalently, In — - = f3 Qg + fi x, 

1 + exp(/3 g + P g x) \l--/ g (x)J 

(7) 

with /3o g € 1R and (3 S R d being parameters to be estimated. When © is considered 
in ([T]), we have Binomial Gaussian CWM or, more simply, Binomial CWM. Then, 
Bernoulli CWM results, as a special case, when M = 1. 

2.2. The Poisson Gaussian CWM 

Assume that Y takes values in N. Moreover, assume that the probability mass 
function of Y\x in il g is Poisson with parameter X g (x), that is Y \x, fl g ~ Poi(A 9 (a;)). 



In this case 



q (y\x;p 0g ,(3 g )=expi-\ g (x)} [ ^^, (8) 
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where A g (x) is postulated to depend on x through the function 

X g (x) = exp(/3 0s + (3' g x) or, equivalently, ln[A s (x)] = f3 0g + (3' g x, (9) 

with (3q 9 G R and f3 g £ TR d being parameters to be estimated. When ((HJ is considered 
in (|T), we have Poisson Gaussian CWM or, more simply, Poisson CWM. 



3. Relationships with finite mixtures of generalized linear models 



In th is section, we extend results given in llngrassia et al.\ (12012al) . Ilngrassia et al. 



d2012bl) to the generalized linear Gaussian CWM. Proofs are given in the Appendix. 
Let 

G 

f(y\x;K) =^2q(y\x;(3 g ,\g)ir g (10) 



9=1 



be a finite mixture of genera l ized linear model (FMGLM ), see e.g 



McLachlan 



1997) 



McLachlan and Peel 



(2000), 



Wedel and De Sarbol d 1995b where k = {/3 , X g ,it g ;g 



1, . . . , G} denotes the overall parameters of the model. According to this model, the 
posterior probability that the generic observation (x 1 , y)' belongs to il g is 



,G. 



(11) 



f{y\x;n) 

Proposition 1. If the marginal distributions <fid(x; fi g , S s ) of X\fl g do not depend on 
Vl g , i.e. fi g , "Eg = /j,, S, g = 1, . . . , G, then model dH) becomes 

P(x, y, 0) = M x - ^> S)/(j/|aj; k), 
where f(y\x; k) was defined in dlOl l. 

Corollary 2 (from Proposition [Jl. Under the assumptions of Proposition Q] the pos- 
terior probability that the generic observation (x' , y)' belongs to Q g from model dill 
coincides with dTTb . 

Remark. We remark that result in Proposition[T]holds in a quite more general context. 
As a matter of fact, the proof does not require any distributional assumption on the 
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marginal distributions, but it needs only that the marginal distributions X|f2 s do not 
depend on group g, i.e. ip g = if) for every g = 1, . . . , G. Thus, the model ([TJ yields: 

p(x, y; 0) = p(x; i/>)f (y\x; k) , 

where f(y\x; <fi) has been defined in ( TTOb . 

An extension of model ( TTOb concerns the case with concomitant va riable; this leads 
to finit e mixtures ofGLMs with concomitant variables (FMGLMC), see lGriin and Leisch 
J2OO8I) 

G 

f(y\ni<p) =^2<l{y\x;(3 g ,X g )p(^g\x;a g ), (12) 

where the mixing weight p(il g \x; a g ) is now a function of x through some parameters 
a, and ip denotes the overall parameters of the model. Note that, in the general formu- 
lation of model ( fT2T > given by [ 



Griin and Leisch 



(2008, p. 3), the concomitant variables 



could be also exogenous to X. The probability p{VL g \x\ a g ) is usually modeled by a 
multinomial logistic distribution with the first component as baseline, that is 

exp(a g0 + oc' g \x) 



p(n g \x;a g 



(13) 



exp(a j0 + ol'^x) 



where a g = (a g0 , cx! gl )' . According to model ( fT2] i. with the specification of p(Cl g \x; a g ) 
given in $13[ , the posterior probability that the generic observation (x' , y)' belongs to 



Q g is 



P(tt g \x,y) 



q(y\x;f3 g ,X g )p{fl g \x;a g 



<l{y\ x ; P„, exp(a s0 + a gl x) 



g\-i«) G G 

Y^q{y\x; (3j, \j)p(Slj\x; "?) q ( y \ x > 3j,Xj)exp(a j0 + a'^x) 

(14) 

Proposition3. If in it results X\Q g ~ i\T(j(/i fl ,I3) and 7r 9 = 7r = 1/G, 5 = 
1,...,G, then 

p(aj, y; 0) = ip)f{y\x; <p), 

G 

where f(y\x; ip) is defined in (fT2l through (fT3l and v(x; ip) = <^d(a;; /J q , £). 

9=1 
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Corollary 4 (from Proposition |3). Under the assumptions of Proposition [3] the pos- 
terior probability that the generic observation (x', y)' belongs to fl g from model © 
coincides with ( fT4l ). 



4. Likelihood function and parameter estimation 

Let (xi,yi), . . . , (xjv, 2/at) be a sample of N independent observation pairs drawn 
from model in (fl~|i and set X = (x[, . . . ,x' n ), Y = (yi, . . . , y n ). The likelihood 
function of the generalized linar Gaussian CWM (|4) is given by: 



N 



N 



L (d;X,y) = l[p(x n ,y n ;0) = \ 



.9=1 



Maximization of Lq(6; X ,y) with respect to 6 yields the maximum likelihood es- 
timate of 8. Previous resu lts under Gaussian assumptions have been presented in 



Ingrassia and Minottil (120121) . Let us consider fully categorized data: 



{w n : n = l,...,N} = {(x n ,y n ,z n ) : n = 1,.. .,N}, 

where z n — (z n i, . . . , z ng )' , with z ng = 1 if (x n , y n ) comes from the g-th population 
and z ng = otherwise. Then, the complete-data likelihood function corresponding to 
W = (wi, . . . ,wn) can be written in the form: 

N G 



L c (0:X,y) = J] ]J [q(y n \x n ; p g , X g )] z ^ [<f> d (x n] ttJ 

™=1 9=1 



(15) 



Taking the logarithm of ( fT3T > after some algebra we get: 
C c (6-X,y) =\nL c (8;X,y) 

N G 

= ^2 ^[zng \nq(y n \x n ;P gl \ g ) + z ng In <j> d {x n \ fi g , S ff ) + z„ ff ln7r g ] 
n=l g=l 

JV G iV G 

7 Mo s ^g )' 

n—1 g—1 n—1 g—1 

N G 

+ J2J2 Zn 9 ln7r 9 

n=l 9=1 



Ac(C) + -CacW+£3c(ff), 



(16) 
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where £ = {/3 , X g ;g = 1, . . . , G} and i/> = {/z , S 9 ; g = 1, . . . , G}. We remark 
that in ( fT6] l the weights 7r are estimated through the posterior probability (|5), see Sec- 
tion[5]for details. 

Relationship with the log-likelihood function of FMGLM. In the following we 
show that, under suitable hypotheses, the maximization of the likelihood function of 
GLGCWM in © leads to the same parameter estimates of FMGLM in ( fTUt . Indeed, 
based on dTOb . the log-likelihood is given by 

N G 

1 1 

n=l g=l 

N G N G 

= ^2^2znglnq(y n \x n ;f3 g ,Xg) + ^ ^ z ng lnvr g 

n—1 g—1 n—1 g—1 

= £l c (0+Ac(7T). (17) 



Proposition 5. In model ©, if the local densities (f)d(x n ; fi g , S 9 ) have the same pa- 
rameters (fig, £ g ) = (fi, S) for g = 1, . . . , G, then maximum likelihood estimate of 
(£, 7r) in (T% coincides with the corresponding estimate in (I161 l. 

Remark. We remark that the above result can be generalized like in the previous case. 
As a matter of fact, the proof of Proposition|5]does not require the Gaussian assumption 
on the marginal distribution. For any marginal distribution p(-;ip g ) (depending on 
some parameter ip g ), if the local densities p(x n ; ip g ) have the same parameters ip g = ip 
for g = 1, . . . , G, then maximum likelihood estimate of (£, 7r) in (fTTI i coincides with 
the corresponding estimate in ([Tol . 

Relationship with the loglikelihood function of FMGLMC. Now let us analyze the 
relationships with FMGLMC. In particular, we show that, under suitable hypotheses, 
the maximization of the likelihood function of GLGCWM in leads to the same 
parameter estimates of FMGLMC in dl2l . Indeed, let us consider the density function 
in dl2l . where the mixing weights p(fl g \x, a g ) are given in (TT~3T >. The corresponding 



9 



complete-data log-likelihood function is given by: 

N G 

£c(0;X,y) = y^y^ [z ng \nq(y n \x n ;f3 g ,X g ) + z ng hxp(Q, g \x, a g )] 

n=l g=l 

= £ lc (Z) + £ Sc (Q, (18) 
where C = {<* g ', 9 = 1, • • • , G}. 

Proposition 6. In model assume that the local densities have the same covariance 
matrices, i.e. 4>d(x; fi g , S s ) = 0d(£c; X) and the prior probabilities be equal, i.e. 
TT g = 1/G for g = 1, . . . , G. Then, the maximum likelihood estimate of (£, £) in ( fT8l 
can be derived from the estimate of (£, ■0) in ( [Tol l. 

4.7. Discussion 

The above results show that both models FMGLM and FMGLMC can be consid- 
ered as nested models of CWM in (Q3 even if they have a different structure. As a mat- 
ter of fact, model (TTOb and (fT~2b consider only conditional distributions, while CWM 
considers joint distributions (as a product of marginal and conditional distributions). 
However, we remark that: 

• if the marginal distributions p(x; fi g , S 9 ) in (Q]i do not depend on the gth group, 
i.e. p(x; fi g , Tl g ) = p(x; fx, S) (g = 1, . . . , G), then the estimates {/3 A g , 7r g , g 

1 G} in ([T]) and ( TTOb are the same according to Proposition|5] in other words, 

if fi g , H g — fi,T (g = 1, . . . , G), then the parameters of the conditional dis- 
tributions in GLGCWM and FMGLM are the same. Moreover, the posterior 
probability <(3j reduces to ([TTV An empirical analysis based on simulated data 
will be presented in Example[T]of Section|7] 

• If the marginal distributions p(x; fi g , S ff ) in (Q~|) are Gaussian with the same co- 
variance matrices, i.e. p(x; fi g , S ff ) = 4>d{x; fi g , S) and the mixing weights are 
equal, i.e. ir g = l/G(g = 1, . . . , G), then the estimates {/3 , X g , g = 1,...,G} 
in (HJ and (TT2l) are the same according to Proposition |6] moreover, the posterior 
probability (O reduces to (fT4t . 
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5. The EM algorithm for the Generalized Linear CWM 

In this section, we present the main steps of the EM algorithm for parameter es- 
timation of the Generalized Linear CWM in ©. In this case, the three terms of the 
complete-data log-likelihood function ( [T6b are given by: 



£ic(£) = EE 2 "9 

n=l 9=1 
N G 



ry nV (x n - 7 [3 g )-b( V (x n -f3 g )) 

— — + c(y,A 9 ) 



a(X g ) 

C 2c {i>) = EE Z ™«4 [-pln27r-ln|S g | - (x n - /x ff )'S~ 1 (x n - fi g 



'2 
n=l s=l 

JV G 

£ 3 c(7T) = ^^2 ng [ln7T g ]. 

n=l g=l 

The E-step on the (fc+l)-th iteration of the EM algorithm requires the calculation of 
the conditional expectation of the complete-data log-likelihood function C c (6; X, y) 
in (O, say Q(6, (k) ), evaluated using the current fit {k) for 9. Since £ c (0; X, y) 
is linear in the unobservable data z ng , this means calculating the current conditional 
expectation of Z ng given X and y, where Z ng is the random variable corresponding 
to z ng , that is 

Q(e,e^)=E e(k) {C c (G;X,y)} 

N G 

= E E E ^> {Z ng \x n ,y n }[Q 1 (f3 g , \ g ;0W) + Q 2 Qu 9 , S s ; W ) + ln7r fl 

n=l g=l 
AT G 

= EE r ^ ) ^i(/ 3 9 ' A 9^ (fc) ) + ^(M s ,5] ff ;0W)+ln7r 5 ], (19) 

n=l g=l 

where Qi((3 g , \ g ;6^) and Q2{^ g , S s ; 0^) depend on the functional form of densi- 
ties p(y n \x n ;(3 g , X g ) and p(x n ;n g , S ff ), respectively: 

Qi Aff; ,«) = ^(^n;^) - ^;^)) + ^ (20) 

Q 2 (/x 3 ,S 9 ;0 (fc) ) = i [-pln27r-ln|E 9 | - (^-^'EJ 1 ^-^)] ■ (21) 

The M-step on the (fc + l)-th iteration of the EM algorithm requires the maximization 
of the conditional expectation of the complete-data log-likelihood Q(0, 6^ k ') with re- 
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spect to 6. The maximization of the quantity in d20b is equivalent to the maximization 
problem of the generalized linear model for the complete data, except that each obser- 

(k) 

vation y n contributes to the log-likelihood for each group g with a known weight Tng . 
The stationary equations are obtained by equating the first order partial derivatives of 
En=i Eli r$Qi(/3 sl A ff ; 0< fc >) to zero, that is 



„ N G , . p. 

O (k) Vnllv^nj Pg 

~dP~ ^ ^ Tng 

U ^9 n =l g=l 



a(X g 







_d_ 

dX r 



N G 



EE^ 



(k) y^( X n-Pg) - b(v( X n;f3 g )) _ 



" n—1 g—1 

which after some algebra yield 

V^V^ (k) yn-b'(r)(x n ;P g )) 



a(X g ) 







n=l 9=1 
N G 

EE' 

n=l 9=1 



1 



a(Xg) Ullg Vng 

[y n r](x n ;P ) - b(j]{x n \P ))] d 



dX t 



d 

a(A 9 ) + g^c(y n ,X g ) = 0. 



Maximization can be obtained by the iterative reweighted least-squares procedure by 
Nelder and Wedderburnl (119721) for ML estimation of generalized linear models, with 



each observation y n weighted additionally with T^ k J . 

For the maximization of the quantity in (fJTJ, the solutions for the weights K g k+1 ^ 
and the parameters estimates /Li ff , £^ fe+1 ) of the local densities are given in closed 
form, that is: 



1 N 

(fe+i) = £y (k) 

71 = 1 



Vg 



y(k+l) 
9 



E 



(fe) 



E 



En=l T "ff J ( :E ™ f^a ^X 33 " 1^9 > y 



(k) (fe) 

.(fe) 



(*=+!)>, 



N (fe) 



see e.g. 



McLachlan and Peel 



n=l 



d2000l) . In the rest of the section, we present two compu- 



tational issues. 



Algorithm initialization. The algorithm has been initialized by assigning an initial 



classification of the units, that is by specifying a value for z 



(o) 



1, . . . ,N (see, 



12 



e.g., 



McLachlan and Peel 2000) 



Convergen ce criterion. T he convergence criterion is based on the Aitken acceleration 



procedure (lAitken 



19261) which is used to estimate the asymptotic maximum of the 
log-likelihood at each iteration of the EM algorithm. Based on this estimate, a decision 
can be made regarding whether or not the algorithm has reached convergence; that is, 
whether or not the log-likelihood is sufficiently close to its estimated asymptotic value. 
The Aitken acceleration at iteration k is given by 

iCw-D _ l(k) 



l(k) _ i(k-i) ' 

where /0+ 1 ), l( k \ and l^'^ are the log-likelihood values from iterations k + 1, k, 
and k — 1, respectively. Then, the asymptotic estimate of the log-likelihood at iteration 
k + 1 is given by 



1 



see 



Bohning et al. 



1 - a« 



(119941) . In the analyses in Section [7] the algorithms stopped when 



j(fc+i) _ t (k) < e> whh £ = o.05. 



6. Performance evaluation 

In order to evaluate the performance of the models introduced in the paper, some 
different indices will be taken into account. They are classified according to the type 
of the available data. In general, model will be selected according to the BIC, see 16. II 
Moreover, when data are simulated and the data labeling is known we can use indices 
which compare the true partition with that arising from the application of a particular 
model (like the misclassification error and indices described in Section I0T2I . On the 
contrary, when we do not know the true labels, we limit our attention to goodness-of-fit 
indices (see Section l631 and Section l6~4l i. 

6.1. Model selection and performance 

In our nume rical studies, we considered the Bayesian information criterion (BIC) 



Schwarjdl978l) 



BIC = 21(0) -mlnJV. 
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where is the ML estimate of 6, 1(9) is the maximized observed-data log-likelihood, 
and m is the overall number of free parameters in the model. 

6.2. The Rand index and the adjusted Rand index 



When the true classification is known, often the adjusted Rand index (ARI; 



Hubert and Arabie 



Rand 



1985) is co nsidered as a measure of class agreement. The original Rand index (RI; 



1 97 ID can be expressed as 

number of agreements 



RI 



number of agreements + number of disagreements ' 

where the number of agreements (observations that should be in the same group and 
are, plus those that should not be in the same group and are not) and the number of 
disagreements are based on pairwise comparisons. The RI assumes values between 
and 1, where indicates no pairwise agreement between the MAP classification and 
true group membership and 1 indicates perfect agreement. 

One criticism of the RI is that its expected value is greater than 0, making smaller 
values difficult to interpret. The ARI corrects the RI for chance by allowing for the 
possibility t hat classification perform ed randomly will correctly classify some obser- 



vations, see 



Hubert and Arabie 



( j 1985b . The ARI can be expressed as 
RI - E(RI) 



ARI = 

max(RI) - E(RI) ' 

where the expected value E(RI) of the Rand Index is computed considering all pairs of 
distinct partitions picked at random, subject to having the original number of classes 
and objects in each. Thus, the ARI has an expected value of and perfect classification 
would result in a value equal to 1 . 

6.3. The index of conditional goodness-of-fit 

We introduce a descriptive measure of goodness-of-fit which is directly related to 
the generalized Pearson x 2 statistic commonly used for generalized linear models. In 
detail, our index of conditional goodness-of-fit (CGOF) is defined as 



2 -. iv Lt 



y n -E(Y\x n ,n g )] 



N ,Ly Y(Y\x n ,n g ) 



(22) 
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As a special case, wh en G = 1, y 2 „ corresponds to the class ical generalize d Pearson 



X 2 statistic (see, e.g. 



McCullagh and Neldei 



2000, p. 34 and 



Olsson 



2006, p. 46). In 



the quantity \w i s divided by TV in order to remove the impact of the sample size. 
Furthermore, the division by the conditional variance Y(Y\x n , f2 g ) makes the squared 
residuals [y n — K(Y\x n , ft g )} 2 comparable between groups. 

6.4. An index of goodness-of-fit based on scaled deviance 

Another goodness- of-fit criterion consists of an e xtension of the traditional scaled 
deviance for GLM (see 



McCullagh and Nelder 2000, p. 34), which is defined as: 



SD = D*(y; M ) = 2l(y;y) - 2l(p,y), 



(23) 



that is the difference between the maximum likelihood achievable in a full model and 
that achieved by the model under investigation. Note that 



D*(y;A)=D(»;A)M 



(24) 



where D(y; y) is the deviance for the current model, so that the scaled deviance is the 
deviance expressed as a multiple of the dispersion parameter. 

Given the forms of the deviances f or Binomial and Poisson distributions, respec- 
tively (see lMcCullagh and Nelden2000i p. 34), the corresponding measures for GLGCWM, 



called generalized deviances, can be defined as: 

JV G 



1\ s 

GD Bin = 2 ^2 ^2 * n 9 1 Vn log 

n=l g=l 
JV G 

2 EE 



Vn 



2/nlog 



E{Y\x n ,n g ] 

Vn 

E{Y\x n ,Sl g ) 



+ {K- y n ) log 



M - y n 



M -E{Y\x n ,Cl g ) 
[y n -E{Y\x n ,n g )) 



- g 

N G 

GD Poi = 2 J2 E 2 
n=l g=l 

Analogous forms are obtained for the generalized scaled deviances, by dividing the 
general deviances for the dispersion parameter, which is 1/M for the Binomial distri- 
bution and 1 for the Poisson distribution, respectively, that is 

GSD Bin = GD Bin /M 
GSDp i = GDp i. 
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7. Numerical studies 

In this section we illustrate some features of the models we proposed above, on 
the ground of numerical studies based on both artificial and real data. We aim also 
at a comparison with the competitive models, that is FMGLM and FMGLMC. Code 
for all of the analyses presented h erein was written in the R computing environment 



(IR Development Core Teaml201 ll) . We remark that the parameter estimation of FMGLM 



and FMGLMC has been carried out by means of the R-package f lexmix, see 



Leisch 



(2004). 



Griin and Leischl (|2008 ) . 



7.1. Simulated data 

Example 1 (Estimates comparison between FMGLM and constrained GLGCWM). 

In Section|4]we stated that FMGLM can be considered as nested models of GLGCWM. 
In particular, we showed that if fi„, S g = fi, £ (g = 1, . . . , G), then the conditional 
distributions in GLGCWM and FMGLM have the same estimates of { /3 g , X g , ir g ; g = 1 , . 
Thus, first numerical simulations concern the comparison between such estimates in 
data modeling according to either Poisson Gaussian CWM and mixture of Poisson 
regressions when the marginal distributions do not depend on the <?th group (in the fol- 
lowing, this case will be referred to as constrained Poisson Gaussian CWM). Here we 
considered G = 2 groups. 

The data have been generated as follows. As for the marginal distributions, we 
considered three cases: X ~ N(5, 0.8), X ~ Unif(4.4, 5.5), and X ~ Unif(4, 5). In 
particular, we remark that two out three cases concern non Gaussian marginal distri- 
butions. As for the conditional distributions, data have been generated according to a 
Poisson distribution ([8]l with the following parameters: /3oi = 1 and j3n = 0.2 in the 
first group, /?02 = and /3i2 = 0.6 in the second group. For each combination of the 
parameters, we have generated 120 random samples with sample sizes iVi = 250 and 
N 2 = 350. 

In each replication, once both the models were fitted to the generated data, the 
following index was computed to evaluate the discrepancy 

EE|^ LGCWM -^ MGLM |/4, (25) 

i=l j=0 
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where /3° LGCWM and /3™ GLM are the ML estimates of for the Poisson GCWM and 
the Poisson GFMR, respectively. For each of the three considered X-distributions, the 
average values, with respect to the 120 replications, of the index in d25l l were respec- 
tively: 0.00131 when X ~ N(5, 0.8), 0.00386 when X ~ Unif(4.4, 5.5), and 0.00259 
when X ~ Unif(4, 5). These results underline as the constrained Poisson GCWM 
reveals to be a very good approximation for the Poisson GFMR, regardless from the 
distribution of X. 

Example 2 (Data from a Binomial Gaussian CWM). The TV = 600 artificial bivari- 
ate data of this example are referred to G = 2 groups of size Ni = 250 and 7V2 = 350. 
They are randomly generated from a Binomial Gaussian CWM, with M = 30, with 
parameters given in the first row of Table Q~](here "true parameters" means the param- 
eters we used for data generation). Figure [Tjdisplays the scatterplot of the data and the 



Table 1: Examplef2] Parameters and estimates according to the Binomial Gaussian CWM. Standard errors 
are given in round brackets 







7T 2 


/ii 


/'2 


<ri 


CT 2 




An 




012 


true parameters 


0.417 


0.583 


2.000 


-2.000 


1.000 


1.000 


0.000 


0.000 


1.000 


1.000 


estimates 


0.427 


0.573 


1.933 


-1.986 


0.996 


0.936 


0.082 


-0.059 


0.969 


0.941 




(0.028) 


(0.032) 


(0.074) 


(0.059) 


(0.114) 


(0.088) 


(0.081) 


(0.071) 


(0.047) 


(0.040) 



fitted models. Figure |2] displays the joint density from a Binomial CWM for different 
values of y. In Table|2]we present the ARIs and the misclassification errors we obtained 
in the three data modeling. Here, it is possible to see as the Binomial Gaussian CWM 
outperforms the other approaches; in particular, we remark that the Binomial MR is 
not able to capture the underlying group-structure of the data. 

Table 2: Example\2\ Clustering performance of the fitted Binomial-based models 





FMR 


FMRC 


CWM 


misclassError 


41.67% 


2.67% 


2.00% 


ARI 


-0.00078 


0.89595 


0.92143 



Example 3 (Data from a Binomial Gaussian CWM.). The N = 250 artificial bi- 
variate data of this example are referred to G = 2 groups of size N± = 100 and 
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(a) TRUE 



(b) Binomial FMR 




(c) Binomial FMRC (d) Binomial CWM 

Figure 1: Example|2] Scatter plots and Binomial-based models. 

N2 = 150. They are randomly generated from a Binomial Gaussian CWM, with 
M = 30, with parameters given in the first row of Table [3] Figure |3]displays the scat- 
terplot of the data and the fitted models. In Table|4]we list the ARIs and the misclassifi- 
cation errors we obtained in the three data modeling. Like in Example[3] the Binomial 
Gaussian CWM outperforms the other approaches but, differently from the previous 
case, here the Binomial Gaussian CWM clearly outperforms the FMRC. Moreover, it 
is possible to see like the clustering results for the FMRC are the worse than those 
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Figure 2: Example[2] Joint density from the fitted Binomial CWM. 



Table 3: Example|3] Parameters and estimates according to the Binomial Gaussian CWM. Standard errors 
are given in round brackets 







7T 2 


I'l 


1*3 


fi 


(72 


An 


002 


0u 


012 


true parameters 


0.400 


0.600 


0.000 


0.000 


1.000 


16.000 


0.000 


0.000 


2.000 


0.500 


estimates 


0.388 


0.612 


0.069 


-0.277 


0.969 


14.252 


0.081 


-0.059 


1.952 


0.494 




(0.044) 


(0.053) 


(0.106) 


(0.305) 


(0.159) 


(1.688) 


(0.058) 


(0.042) 


(0.089) 


(0.015) 



obtained via FMR. 

Table 4: Example^ Clustering performance of the fitted Binomial-based models 





FMR 


FMRC 


CWM 


misclassError 


19.60% 


20.80% 


7.60% 


ART 


0.36436 


0.33591 


0.71775 
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(a) TRUE 



(b) Binomial FMR 




(c) Binomial FMRC (d) Binomial CWM 

Figure 3: Example|3] Scatter plots and Binomial-based models. 

Example 4 (Data from a Poisson GCWM.). The N = 400 artificial bivariate data of 
this example are referred to G = 2 groups of size Ni = 150 and N-2 = 250, respec- 
tively. They are randomly generated from a Poisson CWM with parameters specified 
in the first row of Table [5] Figure |4] displays the scatterplot of the data and the fitted 
Poisson-based models described in the paper. Figure |5]displays the joint density from 
a Poisson CWM. In Table[6]we get the ARI and the misclassification error we obtained 
in the three data modeling. Here, it is possible to see as the Poisson CWM outper- 
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Table 5: Example\4\ Parameters and estimates according to Poisson CWM. Standard errors are given in 
round brackets. 







T'2 


Pi 


M2 






An 




Ai 


ft2 


true parameters 


0.375 


0.625 


0.000 


5.000 


1.500 


0.800 


1.000 


0.000 


0.200 


0.500 


estimates 


0.370 


0.630 


-0.061 


5.016 


1.291 


0.817 


0.985 


0.005 


0.203 


0.497 




(0.031) 


(0.040) 


(0.100) 


(0.060) 


(0.174) 


(0.082) 


(0.051) 


(0.103) 


(0.047) 


(0.019) 




(c) Poisson FMRC (d) Poisson Gaussian CWM 

Figure 4: Example|4] Scatter plots and Poisson-based models. 
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Joint Density 




Figure 5: Example|4] Joint density from the fitted Poisson CWM. 

forms the other approaches; in particular, the Poisson FMR is not able to capture the 
underlying group-structure of the data. 

Table 6: Example^ Clustering performance of the fitted Poisson-based models 





FMR 


FMRC 


CWM 


misclassError 


37.50% 


6.50% 


0.50% 


ART 


-0.00378 


0.75612 


0.97997 



7.2. Real data 



Example 5 (Coupon redemption dat a.). The fol l owing example is based on the coupon 



redemption data analyzed in 



Wedel and De Sarbol (119951) . see also 



Wedell d2000h . The 



sample size is N = 270. The response variable Y is the number of yogurt coupons 
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redeemed by consumers during a period of 104 weeks and the predictor variable X is 
the average price paid for ounce. We ran both constrained and (unconstrained) Poisson 
Gaussian CWM, Poisson FMR and Poisson FMRC. The number of groups associated 
with the largest BIC is G = 2. 

Constrained Poisson Gaussian CWM outperforms the unconstrained model (BIC 
resulted —1952.12 and —1992.64, respectively) while Poisson FMR slightly outper- 
forms Poisson FMRC (BIC resulted —783.71 and —788.55, respectively), where we 
recall that the BIC values cannot be compared directly between the two classes of mod- 
els, due their different nature (see Section HTTV The constrained Poisson CWM and the 
Poisson FMR attained very similar values of GCOF and GSD measures of goodness 
of fit (see Table[8]l. Moreover, they substantially provide the same parameter estimates 
and the same classification (see Table [7] and Figure [6] This confirms again theoretical 
results proved in Section |4] 

group coefficients Constrained Poisson Gaussian CWM Poisson FMR 
1 -1.122 -1.134 

fti 0.025 0.026 



A)2 
012 



-0.435 
0.250 



-0.445 
0.250 



Table 7: Coupon redemption data. Coefficients of Constrained Poisson Gaussian CWM and Poisson FMR. 



Example 6 (Patent data.)- Patent data have been studied in IWang et al 



dl998h and 



are available in the R Flexmix library. They consist of N = 70 observations on 



Constrained Poisson Gaussian CWM Poisson FMR 
GGOF 1.126 1.127 

GSD 279.268 278.886 



Table 8: Coupon redemption data. Generalized weighted Pearson chi-square statistic and generalized scaled 
deviance in Constrained Poisson Gaussian CWM and Poisson FMRC. 
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Constrained Poisson Gaussian CWM Classification 




average price 



Poisson FMR Classification 




average price 



Figure 6: Coupon redemption data. Reclassified data (constrained Poisson Gaussian CWM and Poisson 
FMR). 

patent applications and R&D spending in millions of dollars from pharmaceutical and 
biomedical companies in 1976. The number of patent applications is the response 
variable and the R&D spending is the continuous predictor variable. The number of 
groups associated with the largest BIC is G = 3. 

Here, the unconstrained Poisson Gaussian CWM outperforms the constrained model 
(BIC resulted -774.57 and -793.36, respectively), while Poisson FMRC slightly out- 
performs Poisson FMR (BIC resulted -438.19 and -441.06, respectively). The uncon- 
strained Poisson CWM and the Poisson FMRC attained very similar values of GCOF 
and GSD measures of goodness of fit (see Table [TOl. Again, they substantially provide 
the same parameter estimates and the same classification (see Tableland Figure|7]i. 

Example 7 (Healthcare data.)- This case study is based on administrative data con- 
cerning the healthcare system in the Italian Lombardy region. The sample consists 
of TV = 332 patients on which "length of stay" (in days, count response variable Y) 
and "age" (covariate X) are measured. Only living persons at the end of the hospital- 
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group coefficients Poisson Gaussian CWM Poisson FMRC 



1 


An 


2.383 


2.396 






0.569 


0.567 


2 


A)2 


0.782 


0.803 




Pl2 


0.825 


0.821 


3 


A)3 


-1.812 


-1.801 




013 


1.410 


1.404 



Table 9: Patent data. Coefficients estimates of Poisson Gaussian CWM and Poisson FMRC. 



Poisson Gaussian CWM Classification 




-2 2 

logarithmized R&D spending 



Poisson FMRC Classification 




-2 2 4 

logarithmized R&D spending 



Figure 7: Patent data. Reclassified data (Poisson Gaussian CWM and Poisson FMRC). 

ization have been considered. Two Diagnosis-Related Groups (DRGs), corresponding 
to the codes 348 and 396 are considered, they define two groups of size Ni = 164 
and Ni = 168, respectively. Figure [8(aJ| displays the scatterplot of the data; more- 
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Poisson Gaussian CWM 


Poisson FMRC 


GGOF 


1.283 


1.277 


GSD 


89.659 


87.339 



Table 10: Patent data. Generalized weighted Pearson chi-square statistic and generalized scaled deviance in 
Poisson Gaussian CWM and Poisson FMRC. 

over, in Figures |8](b)-(d) the fitted Poisson-based models FMR, FMRC and CWM and 
the corresponding data clustering are plotted. The ARI and the misclassification er- 
ror obtained in the three data modeling are given in Table QT] The results show that 
the Poisson CWM clearly outperforms both FMR and FMRC, and it yields a perfect 
separation between the two classes. On the contrary, the Poisson FMR and FMRC are 
not able to capture the evident underlying group-structure of the data. We remark that 

Table 1 1 : Example\7\ DRG: 348 and 396. Clustering performance of the fitted Poisson-based models on the 
Healtcare data 





FMR 


FMRC 


CWM 


misclassError 


45.48% 


28.01% 


0.00% 


ARI 


0.00531 


0.19098 


1.00000 



we carried out other analyses considering patients based on other pairs DRGs. Here, 
the Poisson CWM outperformed both FMR and FMRC, even if in many cases it didn't 
yield a perfect separation between classes. The results obtained in clustering data com- 
ing from other pairs of DRGs are given in Table [T2l 

Table 12: Example^ Clustering performance of the fitted Poisson-based models on the Healtcare data 
(other pairs of DRG). 

Couples of DRGs FMR FMRC CWM 

(68,317) 34.67% 37.00% 30.00% 

(345,68) 41.97% 43.00% 34.47% 

(345,168) 43.81% 44.48% 36.12% 
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(a) TRUE (b) Poisson FMR 



(c) Poisson FMRC (d) Poisson CWM 

Figure 8: Example[7j DRG: A and B. Scatter plots and Poisson-based models for the Healthcare data. 

7.3. Discussion 

The results of the numerical studies given in this section emphasizes the effective- 
ness of the GLGCWM in comparison with some finite mixtures of regression models. 
Indeed the results of the above examples can be summarized as follows: 

Example[TJ here the theoretical results given in Section|4]have been investigated from 
the numerical point of view and the constrained Poisson GCWM reveals to be a 
very good approximation for the Poisson GFMR, regardless from the distribution 
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of X. 

Example|2j here FMRC strongly outperformed FMR and CWM gave comparable re- 
sults obtained using FMRC. 

Example|3j here FMR and FMRC gave comparable results while CWM strongly out- 
performed both FMR and FMRC. 

Example ID here FMRC strongly outperformed FMR and CWM outperfomed FMRC. 

ExampleHJ here FMR slightly outperformed FMRC and CWM gave comparable re- 
sults obtained using FMR. 

ExamplelH here FMRC slightly outperformed FMR and CWM gave comparable re- 
sults obtained using FMRC. 

Example E here CWM outperformed both FMR and FMRC. 

In order to deepen such results, in Figures |9| and [TOl we show the distribution along 
the covariate X for both simulated and real data. Figures [9}a, c) (Examples [2] and 
@]i and Figure [TOl c) (Example |7]l highlight that CWM yields better performance than 
FMR when the covariates present a clear group-structure; Figure |9jb) (Examples |3]l 
highlights that CWM yields better performance than FMRC when distributions of the 
covariate s are overlapped with d ifferent ranges. This confirm the previous results ob- 



tained in 



Ingrassia et al. 



(I2012al) . On the contrary, when there is no a group-structure 
in the covariates, then CWM, FMR and FMRC give comparable results, see Figures 
QIIa,b) (Examples|5]and|6). 



8. Concluding remarks 

In this paper we have introduced the generalized linear Gaus sian CWM (GLGCWM) 
and we have shown that they are quite flexible statistical models. Such models al- 
low modeling categorical variables depending on numerical covariates based on data 
coming from a heterogeneous population. Some relationships with finite mixtures of 
generalized mixture models (with and without concomitants) have also been investi- 
gated; in particular, we have shown that mixtures of generalized linear models can be 
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-6 -4 -2 2 4 6 




(a) ExamplefS] Binomial-based model. 



(b) ExamplefJ] Binomial-based model. 




(c) Example|4] Poisson-based model. 
Figure 9: Simulated data examples. Distributions of the X variable. 



considered as nested models in GLGCWM even if they have a different structure. Ap- 
plications to real and artificial data have emphasized the effectiveness of the proposal, 
also in comparison with the other models cited above. 

We remark that in this paper we have considered Gaussian marginal distributions. 
However, the extension to multivariate Student-i is straightforward, and in this case 
the parameters of Student-t local densities can b e estimated according to mix tures of 



McLachlan and Peell d2000l) . In this 



multivariate t distributions, see e.g Section 7.5 in 
direction, we are currently working on further extensions of the models presented here. 
In this framework, an important issue concern data modeling by Cluster Weighted ap- 
proaches when the dimension of the input vector X is large. First results, are given in 



Subedi etal. 



(2012). 
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6 8 10 12 14 
average price 



(a) Example \5\ Coupon data 




(b) Example[6] Patent data 




20 



40 



60 
age 



100 



(c) Example[7] Health data 
Figure 10: Real data examples. Distributions of the X variable. 



An issue which deserves attention for future research concerns the behaviour of the 
EM algorithm. It is well known that the EM algorithm suffers for local maxima and 
singularities an d our first simulation studies c onfirmed previous resu lt s given in liter- 



ature, s ee e.g. 



Faria and Soromenho 



Ingrassia et al. 



d2012al) . 



Ingrassia et al. 



(120101) . 

(I2012bl) . showing that the performance of the algorithm strongly depend on the choice 
of initial values. Moreover, suitable constraints on the eigenvalues of the covariance 
matrices of the marginal distributions can be imposed in order to run the algorithm in a 
parameter space with no si ngularities and a reduced number of local maxima, see e.g. 



Ingrassia and Rocci 



(2007) for details. 



Finally, we point out that the above results open the perspective of model based 
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clustering of mixed type data coming from distributions with density p(x,y), where 
X is a mixed continuous variable and Y is either continuous or discrete. 



Appendix: Proofs 



Proof of Proposition^ If fi g7 H g = fi, S, g = 1,...,G, then © becomes 

G G 

p(x, y;0) = Y^ Q.{y\x\ f3 g ,\ g )<j) d {x; fi, S)tt s = (j) d (x; fj,, S) ^ l(y\ x '> Pg> A s) 7r s 

3=1 3=1 

= 4>d{x;n,'S)f{y\x\K). 



□ 



Proof of Corollary^ If (fi g , S g ) = (/x, X), g = 1, . . . , G, then the posterior proba- 
bility that the generic observation (x' , y)' belongs to Vl g from model (0]i specifies as 



3=1 J=l 



q(y\x;P g , \)^ 9 

G 
3=1 



■j 3=1, ■•■,G, 



which coincides with (TTTT) . 



□ 



P roof of P roposition\3\ From © we get: 



P{x,y;9) =^q(y\x;(3 g ,\ g )(l) d (x- 1 ii g ,?:)n = p(x-,ip)^2p(y\x;f3 g ,\ g ). 



5=1 



3=1 



6 d (a;;/x g ,S)7r 
p(x;ip) 



G exp 
9=1 I> X P 



--(x-mJ'S-^sb-M,) 



i=i 
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where 

exp 



--{x-ii g )'jr 1 {x-n g ) 



G 

H eX P 



1 + ^2 exp 



1 + ^2 exp 



1 + ^ exp 



1 

£ exp(--^.S-V j + M^" 1 *) 



(26) 



If we set 



a g0 = -^MgS V g and a' gl = fi' g T, \ g=l,...,G, (27) 
we recognize that (f26t can be written in form ([Tjt . This completes the proof. □ 



Proof of Corollary® If X|Q S - iV d (/j fl , S) and tt 9 = tt = 1/G, .9 = 1, . . . , G, then 
the posterior probability that the generic observation (x' , j/)' belongs to fi g from model 
(O specifies as 

g(2/|x;/3 9 ,A 9 )0 d (a;;/x s ,S) 



p(fi ff |a;,y) 



9 (2/ 1 a ; /3j , Aj ) 4>d (x ; , S ) 



q(y\x;P g , X g )exp [-±(a: - x (a; - /* fl )] 



53?(y|a5;^j, Aj)exp 
3=1 



--(x-MiJ'S-^x-Mj) 
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which can be written as 

P (n g \x,y)= 9(y\x^ g ,X g )e X p(a B0 + a' gl x) ^ 
^2q(y\x;f3 3 , Xj)exp(a j0 + a.' rjl x) 

where a g o and a g i are specified as in i27i . □ 

Proof of Proposition^ In order to prove the proposition, it is sufficient to show that, 
under the assumption that fj, , S g = fi, X, the terms £i c (£) and C^ir) in (TToT l do 
not depend on fi, S. Indeed, if fi gl S 9 = fi, S then the complete-data log-likelihood 
function becomes: 

N G 

C c (6; X, y) = 2J [ z «9 I n e(yn|:E„;/3 9 , A ff ) + 2„ 9 ln</!>d(a;„; s ) + ^n 9 ln7r 9 ] 

n=l g=l 

= Ac(£) + £2cW>*)+£ 3c (7r), (29) 
where -0* = (jit, X) and £2c(ip) in ( fToT l is now replaced by 

N 



£2c(V'*) = ^ln0 d (x n ;/Lt,S) 



since ^2 g=1 z ng = 1 for n = 1,...,JV. Moreover, since (/ti ff ,S s ) = (/tx, E) for 
g = 1, . . . , G, then the posterior probability (0 reduces to 

P[n g \x,y) - Q ■ 

Thus, 2 nff does not depend on 4>d(x n \ fi g , S ff ) and neither does the term £3 C (7r). In 
summary, the maximization of (|29l can be attained by independently maximizing the 
three terms £i c (£), C2c(ip*) and £3 C (7r) and hence, the maximization of dTD i and d29l 
leads to the same estimates of (£, tt). This completes the proof. □ 

Proof of Proposition [6] In order to prove the result, it is sufficient to show that if 
E g = S and n g = 1/G, g = 1, . . . , G, then the terms £i c (£) and £ 3c (7r) in (fT6] > do 
not depend on (/ti ff , £), # = 1, . . . , G. Indeed, we have: 

N G 

L c (9;X,y) = JJ JJ q(y n \x n ; f3 g , A s )^^(x„; E)*»«7r*»« (30) 

n=lfl=l 
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and taking the logarithm of d30l l. after some algebra we get 
C c (G;X,y)=lnL c (e;X,y) 

N G 

= [z ng In q(y n \x n ; f3 g , \ g ) + z ng \n<f> d (x n ; fi g , S)] + tt 

n=l g=l 

= C lc (S) + £ 2 c('<l>**) + ir, (3D 
where ijj** = {n g , £ ; g = 1, • • • , G} and £2c(ip) m ® is now replaced by 

^ N G 

C2c{ip**) = ^ X!E z "s [-pln27r-ln|S| - (je„ - ^ ff )'E _1 (x n - /x g )] . 

n=l g=l 

Once the estimates of (fx , S) have been obtained, quantity p({l g \x, in < f]~8T > can be 
obtained immediately like in d28l ). This completes the proof. □ 
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